White Wine Quality - Exploratory Data Analysis

Adittya September 2018

Dataset Overview

Wine quality, as Maynard Amerine once said, is easier to detect than define [1]. This is partially due to quality being primarily subjective, and strongly influenced by extrinsic factors. The quality of wine is the result of a complex set of interactions, which include geological and soil variables, climate, and many viticultural decisions. Most serious wine connoisseurs tend to agree on what constitutes wine quality, that is, what they subjectively have come to like through extensive tasting.

In this exploratory analysis we try to determine which chemical properties influence the quality of white wines?

The dataset used covers white wines from Portugal [2]:

For each wine, an expert-derived quality score on a scale of 0 (very bad) to 10 (very excellent) was provided. In addition to the quality score, the dataset contains a measurement for 11 chemical properties of each wine, as will be described later.

First, Lets look at the data provided in the dataset.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

There are 4,898 observations and each observation has 12 variables of interest (excluding the column X as it is simply a sequential count for each observation)

Univariate Plots Section

In this section, we are going to use univariate plots to analyze the data. First, let’s take a look at the variables’ meaning and some descriptive statistics.

There are 11 chemical properties (e.g fixed acidity, volatile acidity etc.) and 1 measure of quality. The main feature in the dataset is ‘quality’, and it the variable that we would like to predict.

The first thing we need to understand is that the quality(our response variable) is scored between between 0 (very bad) and 10 (very excellent).

Before that we will need to drop the ‘X’ variable and check if there are any missing values in the dataset.

f.acidity v.acidity citric.acid res.sugar chlorides free.sulfur total.sulfur density pH sulphates alcohol quality
0 0 0 0 0 0 0 0 0 0 0 0

All columns report a value of 0 and hence there are no missing values.

A statistical summary of the data is given below:

Minimum Mean Median Maximum Variance
fixed.acidity 3.80000 6.8547877 6.80000 14.20000 0.7121136
volatile.acidity 0.08000 0.2782411 0.26000 1.10000 0.0101595
citric.acid 0.00000 0.3341915 0.32000 1.66000 0.0146458
residual.sugar 0.60000 6.3914149 5.20000 65.80000 25.7257702
chlorides 0.00900 0.0457724 0.04300 0.34600 0.0004773
free.sulfur.dioxide 2.00000 35.3080849 34.00000 289.00000 289.2427200
total.sulfur.dioxide 9.00000 138.3606574 134.00000 440.00000 1806.0854908
density 0.98711 0.9940274 0.99374 1.03898 0.0000089
pH 2.72000 3.1882666 3.18000 3.82000 0.0228012
sulphates 0.22000 0.4898469 0.47000 1.08000 0.0130247
alcohol 8.00000 10.5142670 10.40000 14.20000 1.5144270
quality 3.00000 5.8779094 6.00000 9.00000 0.7843557

We can see that resiudual sugar and sulphur dioxide have a hight amount of variance, while density has the lowest amount of variance.

Additionally, we can notice that the quality column consists of integer values and the minimum rating of the wine is 3 and the maximum rating of the wine is 9. It is interesting to note that none of the wines received either a perfect quality score (10) or poor quality score (0). Since quality is a categorical variable (for which the possible values are ordered) we need to transform the variable into an ordered categorical variable.

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

We can see that there are very few samples for quality level 3 (only 20) and even less samples for quality level 9 (only 5). This dearth of data at very low and very high quality scores might make it difficult to draw any statistically significant conclusions about the extremes of the quality scale.

Histograms

We can use histograms, which is the most commonly used graph to show frequency distributions. We will group the 12 features into two groups of histograms (one based on acidity, and the second based on other factors).

The above four parameters (which are all related to acidity) are all normally distributed with some positive skewing.

Looking at the other histograms, we can observe that the residual sugar is not normally distributed and is very highly skewed and aclohol appears to have a trimodal distribution. The rest of the plots have a normal distribution with small psoitive skewness.

At the next section, we will need to exclude the top 1% values for these paraments and apply transformation if necessary.

Box Plots

Box plots are a great way to show the distibution, variance and outliers.

From the boxplots,we can see that Fixed Acidity, Volatile Acidity, Critic Acid, Chlorides, Free Sulfur Dioxide, Total sulfur dioxide, pH and Sulphates have many outliers.

The quality rating seems to be normally distributed, and with most of ratings in the middle bins (5, 6 and 7).

Dealing with outliers

Since we have outliers, it makes sense to plot the data again, but with the top percentile removed, to see if there are any changes in the distribution. In this case, the top 1% seems to be the outlier. We can ignore the quality variable for this plot as the 9 rating will be misrepresented as an outlier.

Let us concentrate on the variables volatile.acidity, citric.acid, residual.sugar, chlorides and look at their distribution once the top 1 percentile is removed. We can see that all these variables have a normal distribution apart from residual sugar which seems to be a bi-modal distribution.

Concentrating closer on residual sugar, we can clearly see that the data has bi-modal distribution once we apply a log transformation. It is inconclusive at this moment whether this influences the quality of wine, but we can say that there are two different groups of wine (one with low residual sugar and other with high residual sugar) or two different techniques of wine production. The reason for the bi-modal distribution can be that there are two sets of white wines (dry and very dry) [5]. We will later explore if the residual sugar changes according to any of the other variables.

The alcohol content is interesting, possibly exhibiting bimodal or even trimodal behavior.

We can observe the sub-populations using Gaussian Mixture Model [6]:

Creating new variables

I created three new variables for exploratory analyisis. The variables were created as a few online articles [3, 4] suggested that the residual sugar levels are adjusted based on the acidity levels of the wine.

Since there are three acidity based variables (I ignored citric acid as its contains a few zeroes and leads to infinity results), lets visulize the distribution of the ratio of residual sigar with regard to volatile.acidity, fixed.acidity and pH.

The Summary stats of the sugar to acidity ratio:

##  sugar.to.v.acidity sugar.to.f.acidity  sugar.to.pH     
##  Min.   :  0.884    Min.   :0.06897    Min.   : 0.1948  
##  1st Qu.:  7.500    1st Qu.:0.25610    1st Qu.: 0.5312  
##  Median : 17.945    Median :0.79136    Median : 1.6200  
##  Mean   : 25.082    Mean   :0.93589    Mean   : 2.0235  
##  3rd Qu.: 36.108    3rd Qu.:1.40432    3rd Qu.: 3.1321  
##  Max.   :134.545    Max.   :8.43590    Max.   :19.4100

From the below visualuzations, we can see that they have a bi-modal or tri-modal distribution.

Univariate Analysis

  • The most important variable is our target variable: the quality of the wine. Of the 7 qualities which appear in this dataset: 3,4,5,6,7,8 and 9, more than 90% of the data falls in 5,6,7.

  • I observed that most of the data is normally distributed with positive skewness. Upon removing the top percentile and applying log transformation (for residual sugar) I observed that most of the features have normal distribution and residual sugar and alochol have a bi-modal dsitribution.

  • It might be interesting to explore the residual sugar to acidity (pH, fixed.acidity, volatile.acidity, citric.acid) ratio and its influence on quality, since I read studies [3, 4] that indicate sweetness reduces the sensation of acidity in wine, and this might have a factor in the quality of the wine.

  • Three new variables were created to explore the sugar to acidity ratio.

Bivariate Plots Section

Let us start the bivariate analysis by exploring the relationship between the variables. We will group the values based on the quality and see if there are any noticables changes in the mean values of other variables.

We can see that the mean values of alcohol and pH increases with quality, and the mea values of total.sulphur.dioxide and chlorides decreases with increase in qulaity.

Correlation Plot & Scatterplot Matrix

Let us visualize the relationships with a correlation plot and scatterplot matrix.

It is very useful to highlight the most correlated variables in a data table. In this plot, correlation coefficients is colored according to the value.

We will look at the scatterplots in two groups as there are too many variables (with both groups involving quality as it is our main interest). This is useful for looking at all possible two-way interactions or correlations between dimensions and will allow us to get a sense as to whether there are trends between various variables in the dataset.

  • The higher qualities (8 and especially 9) have less variance for most of the other variables.
  • There is a meaningful positive relationship between quality and alcohol.
  • Alcohol has a strong negative correlation and a linear relationship with density.
  • Residual sugar and density has a strong positive correlation.

We will take a look at some interesting relations:

1. Density and Residual Sugar

It can be observed that there is a strong-positive relationship between density and residual sugar. As the amount of residual sugar increases, so thus the density in the wine.

2. Alcohol and Density

There appears to be a strong negative relation between alcohol and density, as the density of the wine decreases with increase in alcohol content.

Quality and Sugar to acidity ratios

Lets explore if the newly created variables (sugar to acidity ratios) have any influence on the quality of the wine.

It appears that all three graphs have the same common theme being that higher quality wines have less sugar to acidity ratio, however it is still inconclusive as the median do move up and down (especially between ratings 3-5).

Quality with relation to Alcohol, Chlorides, Density and Volatile Acidity

From the correlation plot in the beginning of the section, we observed that alcohol, density, chlorides and volatile acidity may have some influence on the taste of the wine. Let us further explore this:

  • While there are a few ups and downs, there is a general increase in quality with increase in alcohol content.
  • The chloride levels appear to be decreasing at the higher quality levels.
  • The relationship between density and quality appears to be moderately strong, i.e higher quality wines appear to be of lower density compared to lower quality wines
  • There does not seem to be a strong trend between quality and volatile acidity as the acidity levels fluctuate up and down through the different quality levels.

Since we know that residual.sugar has a bi-modal distribution, let us address each mode seprately and see if they have any influence on the quality.

We can clearly see a difference between the two modal groups in relation to the quality of wine.

Bivariate Analysis

We have found some strong and moderate relations between few variables as observed in the correlation plots and boxplots:

  • In the Density and Residual Sugar plot, we observed that the amount of residual sugar has a strong relation with the density. This is the strongest relation between the variables in this dataset.
  • In the Alcohol and Density plot, we visualized the strong negative correlation between Alcohol and Density.
  • From the quality boxplots, we observed that the quality increases with increase in alcohol, while chlorides and density decreases with increase in quality.
  • The sugar to acidity ratios do not have a strong relation with quality, and we will not further explore this.
  • For the amount of residual sugar 1 to 4 we can see that the quality increases with increase in residual sugar. However, for the amount of residual sugar greater than 4 the quality decreases with increase in residual sugar. This could indicate that people have given high quality ratings for white wine with less residual sugar (1 to 4 usually found in very dry wines [5]). The other group with more residual sugar (greater than 4; dry and sweeter wines) had received ratings differenlty.

Multivariate Plots Section

Based on the previous analysis, I selected a few interesting variables to explore if they influence the quality of the wine.

By isolating the high qulaity factors, we can see that higher qulaity wines have low density and low residual sugar.

By isolating the high qulaity factors, we can see that high quality wines have high alcohol content and low density.

By isolating the high qulaity factors, we can see that high quality wines have high alcohol content and low residual sugar.

There does not seem to be a relation between these variables.

Multivariate Analysis

From the multivariate plots, we can see that there are relationships between Alcohol, density, residual sugar and the quality. So we can build a linear regression model to predict the quality of the wine.

## 
## Call:
## lm(formula = quality ~ alcohol + density + residual.sugar, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5173 -0.5368 -0.0093  0.4739  3.1870 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     90.31292   12.37418   7.298 3.38e-13 ***
## alcohol          0.24587    0.01825  13.474  < 2e-16 ***
## density        -87.88589   12.31680  -7.135 1.11e-12 ***
## residual.sugar   0.05332    0.00509  10.476  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7873 on 4894 degrees of freedom
## Multiple R-squared:  0.2102, Adjusted R-squared:  0.2097 
## F-statistic: 434.1 on 3 and 4894 DF,  p-value: < 2.2e-16

Seeing the above result, we cannot find any influence by the three variables as they consitute to only 20% of the variance.

Transforming Quality to Low, Medium and High

As observed in the initial sections, the number of observations of qualities 3, 4 and 9 are too low. This low number of samples on the extremes of the quality spectrum, it is possible that the dataset is being partitioned too finely. We can try and group the quality ratings into - Low(3-5), Medium(6) and High (7-9).

Low Medium High
1640 2198 1060

Now we have more than 1000 observations under each category and this might help us to draw better conclusions.

With this transformation, we can finally observe some consistent trends in the data:

  • Higher quality wines have more alcohol content
  • Higher quality wines have less density
  • Higher quality wines contain less residual sugar

Final Plots and Summary

There are three final plots that I have chosen which can show the most interesting things that I have found.

Plot One

Description One

The correlation plot enabled us to concentrate on the variables that warranted further anaylsis.

Plot Two

Description Two

From this plot, we observed that residual sugar has sub-populations (one with >= 3 and other <= 4). This indicates that there are different categories of white wine based on the amount of residual sugar and people prefer white wine with less residual sugar(very dry wines).

Plot Three

Description Three

This plot demonstrates that once wine quality is transformed into more coarse bins (i.e. ‘Low’,‘Medium’ and ‘High’ instead of numbers 3-9) then consistent trends emerge in the impact of various chemical properties on wine quality.

Reflection

It was a interesting exploration due to the amount of variables. My initial assumptions were that acidity and resiudual sugar (sugar to acidity ratio) will have a huge ipact on quality, which in the end was not true. I tried to incorporate different visualization libraries (rbokeh, ggplot, plotly) and it was a great learning experience. I struggled a lot to create plots in rbokeh layers using functions and the documentation [7] was somewhat helpful in this regard. The output size has gotten out of hand (>15 MB) due to the usage of too many visualization libraries. Howeverm the size is justified since this is not to be used in production and interactivity was required in the bi-variate and multi-variate plots. If need be, I will try and reduce the size of the report.

From the information I learned regarding mixture models [6], I could further explore this dataset’s bi-modal and tri-modal distributions and see each modes relation to the quality. I did not build any predictive models to the late submission of this project and would love to implement statistical learning techniques on this dataset. It would also be interesting to see if the red wine dataset returns similar results.

References

[1] Nature and Origins of Wine Quality - Ronald S.Jackson https://doi.org/10.1016/B978-0-12-801813-2.00008-2

[2] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

[3] Acidity in wine is important - Madeline Puckette https://winefolly.com/review/understanding-acidity-in-wine/

[4] Wine Jargon: What is Residual Sugar? - Steven Grubbs https://drinks.seriouseats.com/2013/04/wine-jargon-what-is-residual-sugar-riesling-fermentation-steven-grubbs.html

[5] 11 Types of Dry White Wine https://wine.lovetoknow.com/wiki/Types_of_Dry_White_Wine

[6] Using Mixture Models for Clustering http://tinyheero.github.io/2015/10/13/mixture-model.html

[7] Rbokeh http://hafen.github.io/rbokeh/